fix: remove duplicate characters caused by fake bold rendering in PDFs#4215
fix: remove duplicate characters caused by fake bold rendering in PDFs#4215bittoby wants to merge 13 commits intoUnstructured-IO:mainfrom
Conversation
|
@badGarnet Could you please review this PR? Thanks! |
Thanks for contributing! I would suggest finding an example pdf that has this kind of issue and add a test using it. The code reads fine to me but it would be good to test on an actual file. |
|
@badGarnet |
| assert len(text_with_dedup) <= len(text_no_dedup), ( | ||
| f"Deduplicated text ({len(text_with_dedup)} chars) should not be longer " | ||
| f"than non-deduplicated text ({len(text_no_dedup)} chars)" | ||
| ) |
There was a problem hiding this comment.
a better assert would be:
- checking the exact expected text length
- check there is duplicated characters in the
text_no_dedup(likebboolldd) and normal text intext_with_dedupe(likebold)
diagnose_fake_bold.py
Outdated
| @@ -0,0 +1,69 @@ | |||
| """Diagnostic script to verify fake-bold PDF deduplication is working.""" | |||
There was a problem hiding this comment.
a test against the new file is good enough; we don't need to add a script to root dir for this case
…ix/remove-pdf-bold-text-duplication
|
@badGarnet Thanks for your feedback. I've updated. Could you please review again and confirm that it’s configured correctly according to your req? thanks again! |
…ix/remove-pdf-bold-text-duplication
29d32e5 to
355e925
Compare
|
Hi, @badGarnet . I updated all. Hope you merge this when you have a sec |
Head branch was pushed to by a user without write access
|
@badGarnet Thanks for approval. Can you merge the PR! |
|
Sorry for tagging you again, @badGarnet. I faced linting test error, so I updated the code and pushed a new commit. Could you please review it again and merge? Thanks. |
badGarnet
left a comment
There was a problem hiding this comment.
please update the changelog and move your entry to the appropriate section; please also bump the version number
|
I updated changelog and bumped version number |
CHANGELOG.md
Outdated
| - **Add `group_elements_by_parent_id` utility function**: Groups elements by their `parent_id` metadata field for easier document hierarchy traversal (fixes #1489) | ||
|
|
||
| ### Fixes | ||
| - **Fix duplicate characters in PDF bold text extraction**: Some PDFs render bold text by drawing each character twice at slightly offset positions, causing text like "BOLD" to be extracted as "BBOOLLDD". Added character-level deduplication based on position proximity. Configurable via `PDF_CHAR_DUPLICATE_THRESHOLD` environment variable (default: 3.0 pixels, set to 0 to disable)(fixes #3864). |
| @@ -1 +1 @@ | |||
| ## 0.18.35-dev0 | |||
There was a problem hiding this comment.
please bump here as well
|
Sorry, @badGarnet - I misunderstood. I’ve now updated CHANGELOG.md and bumped the version correctly. Could you please check again? Thanks for taking a look. |
|
@badGarnet Thanks for approving. 👍 Could you merge this PR? |
Closes #3864
Summary
Problem
When extracting text from certain PDFs, bold text appears duplicated:
Solution
Added character-level deduplication that:
Configuration